Exploratory Data Analysis (EDA) is one of the fundamental steps in any data science process. It allows us to understand the structure, detect anomalies, and uncover patterns in the data before modeling.
“Without EDA, you’re not doing data science, you’re just guessing.”
EDA combines statistics, programming, and visualization to explore datasets. This report is designed to help you practice these core skills using real-world data.
1.1 Dataset
We will use the movies dataset from vega-datasets, which includes information about thousands of films such as their ratings, genres, duration, and box office revenue.
Let’s load and preview the dataset:
Code
import pandas as pdimport altair as altfrom vega_datasets import data# Load datasetmovies = data.movies()# Show first rowsmovies.head()
Title
US_Gross
Worldwide_Gross
US_DVD_Sales
Production_Budget
Release_Date
MPAA_Rating
Running_Time_min
Distributor
Source
Major_Genre
Creative_Type
Director
Rotten_Tomatoes_Rating
IMDB_Rating
IMDB_Votes
0
The Land Girls
146083.0
146083.0
NaN
8000000.0
Jun 12 1998
R
NaN
Gramercy
None
None
None
None
NaN
6.1
1071.0
1
First Love, Last Rites
10876.0
10876.0
NaN
300000.0
Aug 07 1998
R
NaN
Strand
None
Drama
None
None
NaN
6.9
207.0
2
I Married a Strange Person
203134.0
203134.0
NaN
250000.0
Aug 28 1998
None
NaN
Lionsgate
None
Comedy
None
None
NaN
6.8
865.0
3
Let's Talk About Sex
373615.0
373615.0
NaN
300000.0
Sep 11 1998
None
NaN
Fine Line
None
Comedy
None
None
13.0
NaN
NaN
4
Slam
1009819.0
1087521.0
NaN
1000000.0
Oct 09 1998
R
NaN
Trimark
Original Screenplay
Drama
Contemporary Fiction
None
62.0
3.4
165.0
Now, let’s examine the shape (number of rows and columns) of the dataset:
Code
movies.shape
(3201, 16)
This tells us how many entries (rows) and features (columns) are present in the dataset.
1.2 First Steps
Before diving deeper into the data, it’s useful to explore some key metadata:
✅ The column names and their data types
⚠️ The presence of missing values
📊 Summary statistics for numeric columns
1.2.1 Column Names and Data Types
Understanding the structure of the dataset helps us know what type of data we’re dealing with.
Detecting and handling missing values is a critical step in any EDA process. Missing data can bias analysis or break downstream models if not handled properly.
Detect patterns in missingness
Identify if some columns are almost entirely null
Decide whether to drop or impute certain variables
1.3.1 Percentage of Missing Values per Column
Let’s start by computing the percentage of missing values in each column:
Understanding the nature of variables and the relationships between them is central to Exploratory Data Analysis (EDA). Depending on the number and type of variables involved, we can classify analysis into three main categories: univariate, bivariate, and multivariate.
Table 1 shows how EDA analysis types vary depending on the number and types of variables involved.
This classification helps guide the selection of appropriate visualization techniques and statistical methods for each case.
Table 1: Types of analysis and variable combinations used in EDA
Analysis Type
Variable Types
Description
Examples
Univariate
Categorical
One qualitative variable
Gender, Product Category
Univariate
Quantitative
One numerical variable
Income, Age, Runtime
Bivariate
Categorical – Categorical
Two qualitative variables
Gender vs Nationality
Bivariate
Categorical – Quantitative
One qualitative and one numerical
Province vs Population
Bivariate
Quantitative – Quantitative
Two numerical variables
Age vs Income
Multivariate
3 or more variables (any mix)
Combination of categorical and/or numerical
Age vs Income by Gender, etc.
2.1 Univariate Analysis: Quantitative
A univariate analysis focuses on examining a single numeric variable to understand its distribution, shape, central tendency, and spread. One of the most common tools for this is the histogram.
In this case, we’ll explore the distribution of the movie runtime (Running_Time_min).
2.1.1 Basic Histogram
We start by creating a histogram to visualize the distribution of running times:
Code
alt.Chart(movies_cleaned).mark_bar().encode( alt.X('Running_Time_min',bin=alt.Bin(maxbins=30)), alt.Y('count()')).properties( title='Histogram of Movie Runtimes (30 bins)')
This chart shows how many movies fall into each time interval (bin). However, histograms can look quite different depending on the number and size of bins used.
2.1.2 Effect of Bin Size
Let’s compare how the histogram shape changes with different bin sizes:
Even though both plots use the same data, the choice of bin size changes the visual interpretation. A small number of bins may hide details, while too many bins can make it harder to spot trends.
2.1.3 Density plots, or Kernel Density Estimate (KDE)
Density plots offer a smoothed alternative to histograms. Instead of using rectangular bins to count data points, they estimate the probability density function by placing bell-shaped curves (kernels) at each observation and summing them.
This approach helps reduce the visual noise and jaggedness that can occur in histograms and gives a clearer picture of the underlying distribution.
We can also compare distributions across groups by splitting the KDE by a categorical variable using the groupby parameter. This helps us see how the distribution differs between categories, such as genres.
The transparency (opacity=0.5) allows us to observe overlapping distributions and ensures that small density areas are not completely hidden behind larger ones.
From this plot, we can observe, for example, that Drama movies have runtimes nearly as long as the longest Adventure movies, even though their overall distributions differ.
2.2 Bivariate Analysis: Categorical vs Quantitative
Bivariate analysis examines the relationship between two variables. In this case, we focus on one categorical variable (e.g., genre) and one quantitative variable (e.g., revenue), which is a very common scenario in exploratory data analysis.
This type of analysis is useful to: - Compare average or median values across categories. - Detect outliers or high-variance groups. - Understand distributional differences across categories.
Below are several effective visualizations for this analysis.
2.2.1 Basic Barchart
Bar charts are effective for comparing aggregated values (like the mean) across different groups. However, they hide the distribution and variation within each group.
Code
alt.Chart(movies_cleaned).mark_bar().encode( alt.X('mean(Worldwide_Gross)'), alt.Y("Major_Genre")).properties( title="Average Worldwide Gross by Genre")
This bar chart shows the mean Worldwide Gross per genre. It is useful for identifying which genres are more profitable on average, but does not show how spread out the data is.
2.2.2 Tick Plot
To visualize individual data points, we use a tick plot. This helps uncover variability within genres and detect outliers.
Code
alt.Chart(movies_cleaned).mark_tick().encode( alt.X('Worldwide_Gross'), alt.Y("Major_Genre"), alt.Tooltip('Title:N')).properties( title="Individual Gross per Movie by Genre")
2.2.3 Heatmaps
Heatmaps can summarize the frequency of data points across both axes (quantitative and categorical) using color intensity. It’s particularly useful for spotting patterns without getting overwhelmed by individual points.
Code
alt.Chart(movies_cleaned).mark_rect().encode( alt.X('Worldwide_Gross',bin=alt.Bin(maxbins=100)), alt.Y("Major_Genre"), alt.Color('count()'), alt.Tooltip('count()')).properties( title="Heatmap of Movie Counts by Gross and Genre")
This heatmap shows how frequently movies from each genre fall into different revenue ranges.
2.2.4 Boxplot
Boxplots are useful for comparing distributions across categories and identifying outliers. Boxplots summarize a distribution using five statistics:
Median (Q2)
First Quartile (Q1)
Third Quartile (Q3)
Lower Whisker (Q1 - 1.5 × IQR)
Upper Whisker (Q3 + 1.5 × IQR)
Code
alt.Chart(movies_cleaned).mark_boxplot().encode( alt.X('Worldwide_Gross'), alt.Y("Major_Genre")).properties( title="Boxplot of Worldwide Gross by Genre")
2.2.5 Side-by-side: Boxplot and Bar Chart
To contrast aggregated values (bar chart) with the full distribution (boxplot), we can display them together:
Code
bar = alt.Chart(movies_cleaned).mark_bar().encode( alt.X('mean(Worldwide_Gross)'), alt.Y("Major_Genre"))box = alt.Chart(movies_cleaned).mark_boxplot().encode( alt.X('mean(Worldwide_Gross)'), alt.Y("Major_Genre"))box | bar
This comparison reveals whether the mean is a good representative of the genre, or whether the data is skewed or contains outliers that affect the average
2.3 Bivariate Analysis: Quantitative vs Quantitative
When analyzing two quantitative (numerical) variables simultaneously, we aim to discover possible relationships, trends, or correlations. This type of bivariate analysis can reveal whether increases in one variable are associated with increases or decreases in another (positive or negative correlation), or if there’s no relationship at all. The most common and intuitive visualization for this is the scatterplot.
2.3.1 Scatterplots
Scatter plots are effective visualizations for exploring two-dimensional distributions, allowing us to identify patterns, trends, clusters, or outliers.
Let’s start by visualizing how movies are rated across two popular online platforms:
Are movies rated similarly on different platforms?
Code
alt.Chart(movies_cleaned).mark_point().encode( alt.X('IMDB_Rating'), alt.Y('Rotten_Tomatoes_Rating')).properties( title="IMDB vs Rotten Tomatoes Ratings")
2.3.2 Scatterplot Saturation
Scatterplots can become saturated when too many points overlap in a small area of the chart, making it difficult to distinguish dense regions from sparse ones. For example, when plotting financial variables like production budget versus worldwide gross:
Code
saturated = alt.Chart(movies_cleaned).mark_point().encode( alt.X('Production_Budget'), alt.Y('Worldwide_Gross')).properties( title="Saturated Scatterplot: Budget vs Gross")saturated
2.3.3 Using Binned Heatmap to Reduce Saturation
To address saturation, we can bin both variables and use a heatmap where the color intensity represents the number of movies that fall into each rectangular region of the grid. This makes dense areas more interpretable
Compare the raw scatterplot with the heatmap representation:
Code
saturated | heatmap_scatter
2.4 Bivariate Analysis: Categorical vs Categorical
When working with two categorical variables, bivariate analysis helps us understand how categories from one variable relate or are distributed across the other. For example, we might want to know how different movie genres are rated according to the MPAA rating system. Visualization techniques like grouped bar charts and faceted plots can reveal patterns, associations, or class imbalances.
2.4.1 Basic Faceted Bar Chart
We begin by exploring how movies are rated (MPAA_Rating) across different genres (Major_Genre). A faceted bar chart allows us to visualize this relationship by plotting a bar chart per genre, helping to identify genre-specific rating distributions.
Faceting horizontally can make comparisons across genres harder when the x-axis is misaligned. By specifying columns=1, we lay out the facets vertically, making it easier to compare counts across genres.
By default, facet plots share the same x-axis scale (dependent scale), which allows for easier comparison across panels. However, when the number of observations varies greatly between genres, this shared scale can compress some charts.
We can instead use independent x-axis scaling for each facet. This highlights the relative distribution within each genre.
The left panel (shared scale) makes absolute comparisons between genres, while the right panel (independent scale) makes within-genre comparisons more readable.
2.4.4 Heatmaps
Heatmaps are effective for visualizing the relationship between two categorical variables when the goal is to display counts or frequency of occurrences. They map the number of observations to color, providing an intuitive view of which category pairs are most or least common.
We can enhance this basic representation by also using marker size, combining both color intensity and circle area to represent counts more effectively. This dual encoding can improve interpretation, especially when printed in grayscale or when there are subtle color differences.
Code
heatmap_color = alt.Chart(movies_cleaned).mark_rect().encode( alt.X('MPAA_Rating'), alt.Y('Major_Genre', sort='color'), alt.Color('count()')).properties( title="Heatmap with Color (Count of Movies)")heatmap_size = alt.Chart(movies_cleaned).mark_circle().encode( alt.X('MPAA_Rating'), alt.Y('Major_Genre', sort='color'), alt.Color('count()'), alt.Size('count()')).properties( title="Heatmap with Color + Size (Count of Movies)")heatmap_color | heatmap_size
2.5 Multivariate Analysis
Multivariate analysis helps us understand the interactions and relationships among multiple variables simultaneously. In the context of numerical features, it is useful to explore pairwise distributions, correlations, and detect potential clusters or anomalies.
When the number of variables is large, repeated charts such as histograms or scatter plot matrices help us summarize patterns efficiently and consistently across all numerical dimensions.
2.5.1 Repeated Histograms for Numerical Columns
We first identify and isolate all numerical columns from the dataset. Then we repeat a histogram for each of these columns to understand the individual distributions. This overview is helpful to detect skewness, outliers, or binning decisions that affect how data is grouped visually.
Code
# Select only numerical columnsnumerical_columns = movies_cleaned.select_dtypes('number').columns.tolist()
When scatter plots become too saturated (many overlapping points), heatmaps offer a better alternative by binning the numeric values and encoding the count in color intensity.
To gain deeper insights into the dataset, it’s important to analyze how numerical variables behave across different categories. This type of multivariate analysis allows us to:
Compare distributions across categories
Detect outliers within categories
Observe central tendency (median, quartiles) and spread (range, IQR)
Boxplots are particularly effective for this purpose. In the following visualizations, we explore these relationships by repeating plots across combinations of categorical and numerical features.
2.5.4 Filter Categorical Columns
First, we select the relevant categorical columns, excluding identifiers and text-heavy variables like movie titles or director names.
Code
categorical_columns = movies_cleaned.select_dtypes('object').columns.to_list()categorical_columns_remove = ['Title','Release_Date','Distributor','Director']categorical_filtered = [col for col in categorical_columns if col notin categorical_columns_remove]
2.5.5 Repeated Boxplots: Categorical vs Numerical
We repeat boxplots using combinations of categorical (rows) and numerical (columns) features. This matrix layout gives a clear visual overview of how numerical values are distributed within each category.
For more focused analysis, we can facet the boxplots using a specific categorical variable like MPAA_Rating, and repeat the chart by different categorical rows. This lets us keep the numerical axis fixed while comparing how categories vary across different classes (e.g., movie ratings).